Search CORE

32 research outputs found

KERT: Automatic Extraction and Ranking of Topical Keyphrases from Content-Representative Document Titles

Author: Danilevsky Marina
Desai Nihit
Guo Jingyi
Han Jiawei
Wang Chi
Publication venue
Publication date: 02/06/2013
Field of study

We introduce KERT (Keyphrase Extraction and Ranking by Topic), a framework for topical keyphrase generation and ranking. By shifting from the unigram-centric traditional methods of unsupervised keyphrase extraction to a phrase-centric approach, we are able to directly compare and rank phrases of different lengths. We construct a topical keyphrase ranking function which implements the four criteria that represent high quality topical keyphrases (coverage, purity, phraseness, and completeness). The effectiveness of our approach is demonstrated on two collections of content-representative titles in the domains of Computer Science and Physics.Comment: 9 page

arXiv.org e-Print Archive

CiteSeerX

Evaluating Robustness of Dialogue Summarization Models in the Presence of Naturally Occurring Variations

Author: Danilevsky Marina
Ganhotra Jatin
Gunasekara Chulaka
Gupta Ankita
Joshi Sachindra
Wan Hui
Publication venue
Publication date: 15/11/2023
Field of study

Dialogue summarization task involves summarizing long conversations while preserving the most salient information. Real-life dialogues often involve naturally occurring variations (e.g., repetitions, hesitations) and existing dialogue summarization models suffer from performance drop on such conversations. In this study, we systematically investigate the impact of such variations on state-of-the-art dialogue summarization models using publicly available datasets. To simulate real-life variations, we introduce two types of perturbations: utterance-level perturbations that modify individual utterances with errors and language variations, and dialogue-level perturbations that add non-informative exchanges (e.g., repetitions, greetings). We conduct our analysis along three dimensions of robustness: consistency, saliency, and faithfulness, which capture different aspects of the summarization model's performance. We find that both fine-tuned and instruction-tuned models are affected by input variations, with the latter being more susceptible, particularly to dialogue-level perturbations. We also validate our findings via human evaluation. Finally, we investigate if the robustness of fine-tuned models can be improved by training them with a fraction of perturbed data and observe that this approach is insufficient to address robustness challenges with current models and thus warrants a more thorough investigation to identify better solutions. Overall, our work highlights robustness challenges in dialogue summarization and provides insights for future research

arXiv.org e-Print Archive

Label Sleuth: From Unlabeled Text to a Classifier in a Few Hours

Author: Aharonov Ranit
Choshen Leshem
Cooper Martin Santillan
Danilevsky Marina
Dankin Lena
Ein-Dor Liat
Epelboim Dina
Gera Ariel
Halfon Alon
Katsis Yannis
Katz Yoav
Li Yunyao
Liberman Naftali
Newton Gwilym
Ofek-Koifman Shila
Shnarch Eyal
Shnayderman Ilya
Slesarev Philip Levin
Slonim Noam
Wang Dakuo
Yip Lucy
Zhang Zheng
Publication venue
Publication date: 02/08/2022
Field of study

Text classification can be useful in many real-world scenarios, saving a lot of time for end users. However, building a custom classifier typically requires coding skills and ML knowledge, which poses a significant barrier for many potential users. To lift this barrier, we introduce Label Sleuth, a free open source system for labeling and creating text classifiers. This system is unique for (a) being a no-code system, making NLP accessible to non-experts, (b) guiding users through the entire labeling process until they obtain a custom classifier, making the process efficient -- from cold start to classifier in a few hours, and (c) being open for configuration and extension by developers. By open sourcing Label Sleuth we hope to build a community of users and developers that will broaden the utilization of NLP models.Comment: 7 pages, 2 figure

arXiv.org e-Print Archive

SCENE: Structural Conversation Evolution Network

Author: Danilevsky Marina
Publication venue
Publication date: 01/08/2011
Field of study

It???s not just what you say, but it is how you say it. To date, the majority of the Instant Message (IM) analysis and research has focused on the content of the conversation.The main research question has been, ???what do people talk about???? focusing on topic extraction and topic modeling. While content is clearly critical for many real-world applications, we have largely ignored identifying ???how??? people communicate. Conversation structure and communication patterns provide deep insight into how conversations evolve, and how the content is shared. Motivated by theoretical work from psychology and linguistics in the area of conversation alignment, we introduce SCENE, an evolution network approach to extract knowledge from a conversation network. We demonstrate the potential of our approach by taking the task of matching conversation partners. We find that SCENE is more successful because, in contrast to existing approaches, SCENE treats a conversation as an evolving, rather than a static document, and focuses on the structural elements of the conversation instead of being tied to the specific content

Illinois Digital Environment for Access to Learning and Scholarship Repository

Graph-based Classification on Heterogeneous Information Networks

Author: Danilevsky Marina
Han Jiawei
Ji Ming
Sun Yizhou
Publication venue
Publication date: 30/04/2010
Field of study

A heterogeneous information network is a network composed of multiple types of objects and links. Recently, it has been recognized that strongly-typed heterogeneous information networks are prevalent in the real world. Sometimes, label information is available for part of the objects. Learning from such labeled and unlabeled data via classification can lead to good knowledge extraction of the hidden network structure. However, although classification on homogeneous networks has been studied over decades, classification on heterogeneous networks has not been explored until recently. In this paper, we consider the transductive classification problem on heterogeneous networked data which share a common topic. Only part of the objects in the given network are labeled, and we aim to predict labels for all types of the remaining objects. A novel graph-based regularization framework, GNetClass, is proposed to model the link structure in information networks with arbitrary network schema and number of object/link types. Specifically, we explicitly respect the type differences by preserving consistency over each relation graph corresponding to each type of links separately. Efficient computational schemes are then introduced to solve the corresponding optimization problem. Experiments on the DBLP data set show that our algorithm significantly improves the classification accuracy over existing state-of-the-art methods.unpublishedis peer reviewe

Illinois Digital Environment for Access to Learning and Scholarship Repository

Discovering latent topical phrases in document collections and networks with text components: Leveraging text mining and information network analysis for human oriented applications

Author: Danilevsky Marina
Publication venue
Publication date: 01/05/2014
Field of study

One of the major challenges of mining topics from a large corpus is the quality of the constructed topics. While phrase-generating approaches generally produce high quality output, they do not scale very well with the size of the data. Thus, the state of the art solutions usually rely upon scalable unigram-generating methods, which do not produce high quality human-readable topics, or are forced to use external knowledge bases. Furthermore, while document collections naturally contain topics at different levels of granularity (general vs. specific), very few traditional methods focus on generating high quality hierarchical topic structures. This dissertation presents a series of approaches that directly addresses these challenges of generating high quality phrase-based topics, both as a flat set and organized as a hierarchy, as well as some potential applications. First, we describe a framework that generates high-quality topics represented by integrated lists of mixed-length phrases. The key is adapting a phrase-centric view towards the construction and ranking of topical phrases. The approach is domain-independent, and requires neither expert supervision nor an external knowledge base. The framework is initially constructed to work on collections of short texts, such as titles of scientific documents. However, we then show how the framework can be easily and robustly extended to work on collections of longer texts, and demonstrate its applicability to human needs with a task-centric evaluation. The dissertation then addresses the need to move beyond generating a flat set of topics, and present an approach to constructing hierarchical topics, which extends the phrase-centric approach to create high quality phrases at varying levels of granularity. Another application of this technique is then presented: the task of entity role discovery. By tying entities in a community to topical phrases, users are able to explicitly understand both how and why individual entities are ranked within a specific community. A final extension is then described, which is a combined approach for constructing the hierarchy, which uses entity link information to improve the hierarchy quality

Illinois Digital Environment for Access to Learning and Scholarship Repository

SCENE: Structural Conversation Evolution NEtwork

Author: Jiawei Han
Joshua Hailpern
Marina Danilevsky
Publication venue
Publication date: 05/05/2012
Field of study

Abstract—It’s not just what you say, but it is how you say it. To date, the majority of the Instant Message (IM) analysis and research has focused on the content of the conversation.The main research question has been, ‘what do people talk about? ’ focusing on topic extraction and topic modeling. While content is clearly critical for many real-world applications, we have largely ignored identifying ‘how ’ people communicate. Conversation structure and communication patterns provide deep insight into how conversations evolve, and how the content is shared. Motivated by theoretical work from psychology and linguistics in the area of conversation alignment, we introduce SCENE, an evolution network approach to extract knowledge from a conversation network. We demonstrate the potential of our approach by taking the task of matching conversation partners. We find that SCENE is more successful because, in contrast to existing approaches, SCENE treats a conversation as an evolving, rather than a static document, and focuses on the structural elements of the conversation instead of being tied to the specific content. I

CiteSeerX

Crossref